{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lab 13 - Simulations and hypotheses\n",
"\n",
"We will start this lab by simulating some data using the [NumPy](http://www.numpy.org) package. The NumPy package is used by Pandas, so should have been installed when we installed Pandas. However, we still need to import it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Simulating Data\n",
"\n",
"#### Smokers and non-smokers in New York City\n",
"\n",
"The smoking rate in New York City was 11.5% in 2016. This statistic means that in theory if we picked 100 New Yorkers at random, we expect 11.6 of them to smoke. But what happens in practice if we select 100 random New Yorkers? We are going to simulate this scenario using code.\n",
"\n",
"First we define our population:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"population = [\"Smoker\",\"Non-smoker\"]\n",
"pop_prob = [0.116,1-0.116]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This code defines our population as having two groups `Smoker` and `Non-smoker`, with the probabilities 0.116 and 1-0.116 = 0.884, respectively. \n",
"\n",
"We can generate a random sample of 100 people from our population with the code `np.random.choice(population,p=pop_prob,size=100)`. Type and run it below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code has simulated 100 people and labeled each as a `Smoker` or `Non-smoker` according to the probabilities we gave. Try re-running this code. What happens?\n",
"\n",
"Also notice that the word `array` appears at the beginning of the output. An array is similar to a list or a Series, in that it holds many different values of the same type (integers, strings, etc.). This particular array is an `ndarray`, which is one of the kinds of arrays produced by `NumPy`. We have to convert the array into a Pandas Series before we can use any Pandas function on it.\n",
"\n",
"Next we want to count the number of smokers, which we do by:\n",
"\n",
"1. saving the array into a variable\n",
"2. converting the array into a Pandas Series using `Pd.Series(name_of_array)`\n",
"3. applying `value_counts()` to the Series to count the number of smokers and non-smokers\n",
"\n",
"Can figure out how to write this code?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Answer:
\n",
" sample_array = np.random.choice(population,p=pop_prob,size=100)\n",
"pd.Series(sample_array).value_counts()
\n",
" \n",
"\n",
"To understand the variation in the number of smokers, we will take a bunch of random samples and make a histogram of the counts of smokers in each sample. To do this, we will need to extract the number of smokers from the `value_counts()` output.\n",
"\n",
"Save the output of `value_counts()` above in the variable `counts`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To compute the number of smokers in `counts`, use the code `counts[\"Smoker\"]`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In Lab 11, we sampled multiple times from our dataset and made a histogram of the mean, median, or variance of the samples. In this lab, we'll simulate multiple samples and make a histogram of the number of smokers in each sample.\n",
"\n",
"The pseudo-code for how to do this is:\n",
"\n",
"create a new list for the counts\n",
"loop 200 times:\n",
" simulate a new sample\n",
" count the number of smokers in the sample\n",
" add (append) this smoker count to your list\n",
"turn the list into a Pandas series and plot a histogram from it
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Pattern:
\n",
"new_list = []\n",
"for i in range(num_times_to_loop):\n",
" sample = np.random.choice(population,p=pop_prob,size=sample_size)\n",
" num_in_category = sample[\"category_to_count\"]\n",
" new_list.append(num_in_category)\n",
"pd.Series(new_list).hist()\n",
"
\n",
" \n",
"\n",
"What do you notice about the histogram? Run your code again. How does the histogram change (or not change)?\n",
"\n",
"What happens when you increase the number of simulations?\n",
"\n",
"#### Birth rate of boys and girls\n",
"\n",
"In most high-income countries, the percentage of births that are boys is 51.2%. Consider samples of 100 births. How much variation is there in the number of boys born? Generate 250 samples of size 100 and plot a histgram of the number of boys born in each sample."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Hypotheses\n",
"\n",
"A *statistical hypothesis* is a specific, testable assumption about a population that is either true or false. Next class we will see how to formally test a hypothesis. For this lab, we will focus on making our hypotheses specific enough that we can get some kind of answer from the data.\n",
"\n",
"Suppose we want to know if taxis with more passengers take longer trips. We can rephrase this as a statement that is either true or false:\n",
"\n",
"*Taxis with more passengers take longer trips.*\n",
"\n",
"Presumably we can look at trips with lots of passengers and trips with few passengers, and compare how long these trips are. But what is the cut-off between between lots and few passengers? What exactly should we compare: the mean trip length? The median trip length? Something else?\n",
"\n",
"We will encode our choices to the above questions in the hypothesis:\n",
"\n",
"*Taxis with more than 1 passenger take longer trips on average than taxis with more than 1 passenger.*\n",
"\n",
"Now we can test this hypothesis. First load the green taxi trip dataset into a dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, make a dataframe containing only the green taxi trips with 1 passenger."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Pattern:
\n",
"filter_variable = df[\"column_name\"] == value\n",
"new_df = df[filter_variable] \n",
"
\n",
" \n",
"\n",
"What is the mean trip distance for these green taxi trips with only 1 passenger?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, make a dataframe containing only green taxi trips with 2 or more passengers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What is the mean trip distance for the green taxi trips with 2 or more passengers? "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Is there any difference between the mean length of taxi trips with 1 passenger and taxi trips with 2 or more passengers? Do you think this difference is significant?\n",
"\n",
"#### Challenges:\n",
"- Is there any difference between the mean length of taxi trips that begin at JFK or Newark airports? (RateCodeID is 2 or 3) and the other taxi trips?\n",
"- Is there any difference between the median length of taxi trips paid by credit card (Payment_type is 1) and trips paid by cash (Payment_type is 2)?\n",
"- Is there any difference in the mean number of passengers in trips taken at night from those taken during the day?"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}